library(readr)
library(ggplot2)
library(dplyr)
# add any others that you need here
library(GGally)
library(gridExtra)
Download the toy data (pca_toy_data.csv) from Moodle and
read it into R:
pcadata <- read_csv('pca_toy_data.csv')
## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (17): population, trait, SNP1, SNP2, SNP3, SNP4, SNP5, SNP6, SNP7, SNP8,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Explore the toy dataset, with the goal of answering each of the questions below.
How many people are in this dataset? How many SNPs?
# look at first few rows
head(pcadata)
## # A tibble: 6 × 17
## population trait SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.111 0 0 0 2 0 0 0 0 0 1
## 2 1 1.47 0 0 0 0 0 0 0 0 0 0
## 3 1 1.43 0 0 0 0 0 0 1 0 1 0
## 4 1 -1.72 0 0 0 1 0 0 0 1 0 0
## 5 1 -0.356 0 0 0 0 0 1 1 1 0 0
## 6 1 0.516 0 0 0 0 0 0 1 0 0 0
## # … with 5 more variables: SNP11 <dbl>, SNP12 <dbl>, SNP13 <dbl>, SNP14 <dbl>,
## # SNP15 <dbl>
# calculate dimensions
dim(pcadata)
## [1] 1000 17
There are are 1000 people and 15 SNPs
How many populations are represented? How many people belong to each population?
# count how many people are in each population
pcadata %>%
count(population)
## # A tibble: 2 × 2
## population n
## <dbl> <int>
## 1 1 500
## 2 2 500
There are two populations (Population 1 and Population 2), with 500 people from each.
Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP?
We can calculate the observed allele frequency for a SNP as follows:
# function to get empirical minor allele frequency
## count up how many minor alleles are observed
## divide by two times the number of people
get_MAF <- function(snp){
sum(snp)/(2*length(snp))
}
# get observed allele frequency for each population
pcadata %>%
group_by(population) %>%
summarize_all(get_MAF)
## # A tibble: 2 × 17
## population trait SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.268 0 0 0.298 0.291 0.293 0.102 0.107 0.094 0.095 0.097
## 2 2 0.997 1 1 0.512 0.506 0.504 0.11 0.085 0.097 0.092 0.111
## # … with 5 more variables: SNP11 <dbl>, SNP12 <dbl>, SNP13 <dbl>, SNP14 <dbl>,
## # SNP15 <dbl>
From this, we notice that the first two SNPs have extreme differences in the minor allele frequencies between Populations 1 and 2: no one in Population 1 has the minor allele at either SNP, whereas everyone in Population 2 has two copies.
We also see that SNPs 3–5 have different frequencies in the two populations, with the minor allele being more common (MAF \(\approx\) 50%) in Population 2 as opposed to Population 1 (MAF \(\approx\) 30%).
The remaining SNPs seem to be approximately equally frequent across the two populations, with a minor allele frequency of about 10% in both Population 1 and Population 2.
There is more than one way to run PCA in R. We’ll use similar code to what you may have seen in STAT 253: Statistical Machine Learning. If PCA is new to you, I recommend taking some time to read through these notes after class.
First, we’ll need to set up the data. Create a data frame called
geno that contains only the genotype information for each
of the SNPs (i.e., remove any other columns).
geno <- pcadata %>%
select(-population, -trait)
Then, run the code chunk below to perform PCA using the
prcomp function.
pca_out <- prcomp(geno, center = TRUE, scale = TRUE)
Open the help page for prcomp and read through the
description of what it returns. (Hint: see the “Value” section of the
help page.) Which component of the output contains the loadings? Which
part contains the scores?
The
rotationelement contains the loadings and thexelement contains the scores (the “value[s] of the rotated data”).
Based on what you learned from the help page, extract the loadings
and scores and save them as new objects called pca_loadings
and pca_scores, respectively. (Hint: use $ to
extract a particular component from a list, e.g.,
list$name_of_component.)
pca_loadings <- pca_out$rotation
pca_scores <- pca_out$x
Create a scatterplot comparing the scores for the first two PCs, with points colored by which population the individual belongs to.
pca_scores %>%
as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(pcadata$population)) %>% # add the population labels
ggplot(aes(x = PC1, y = PC2, color = population)) + # then plot
geom_point() +
scale_color_brewer(palette = 'Dark2')
What do you learn from this plot?
Although we didn’t give PCA any information about the populations that these individuals belonged to, we see that the first PC seems to separate individuals out into these two populations!
Another way to visualize the PC scores is to make what’s known as a parallel coordinates plot. This plot will allow us to look at all of our PCs at once. As above, we’ll color each of the individuals according to which population they belong to. I’ll give you most of the code for this one:
# parallel coordinates plot
pca_scores %>%
as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(pcadata$population)) %>% # add the population labels
ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) + # plot the first 15 columns
theme_minimal() +
scale_color_brewer(palette = 'Dark2')
What do you learn from the parallel coordinates plot?
The parallel coordinates plot is a way of visualizing the scores for all of our PCs at once. Each line on this plot is one person, and tracing their path through the parallel coordinates plot tells us their scores for each PC. Coloring the plot by the population labels can highlight patterns in terms of which PCs seem to be capturing population membership. In this case, we can see that the first PC seems to be separating out the individuals in Population 1 (lower scores) from Population 2 (higher scores), whereas the mixture of colors for later PCs indicated that those PCs don’t seem to be capturing ancestry.
Note: in practice, some groups use parallel coordinates plots to decide how many PCs they want to adjust for in their GWAS models! Here’s an example from a study that I have worked with: Hispanic Community Health Study/Study of Latinos.
Next, let’s investigate which SNPs have the largest contributions to the first PC by looking at the loadings. Run the code chunk below to plot the loadings for PC1.
pca_loadings %>%
as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC1)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC1') +
theme_minimal()
Which SNPs have the highest loadings for PC1? Which SNPs have the lowest loadings? Why do you think this is?
SNPs 1 and 2 have the highest loadings, followeed by SNPs 3–5. These are the SNPs that have different allele frequencies in the two populations.
Now, repeat this process for the second PC, third, and fourth PCs. What do you notice?
l2 <- pca_loadings %>%
as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC2)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC2') +
theme_minimal()
l3 <- pca_loadings %>%
as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC3)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC3') +
theme_minimal()
l4 <- pca_loadings %>%
as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC4)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC4') +
theme_minimal()
grid.arrange(l2, l3, l4, ncol = 1)
There doesn’t seem to be as much of a distinct pattern for these later PCs in terms of particular SNPs having more weight.
As we’ve discussed in class, the principal components are ordered in terms of the amount of variance they explain in the original data. To see exactly what proportion of total variance each PC explains, we can create what’s known as a scree plot. The code chunk below calculates the proportion of variance explained and then plots this for each PC.
# extract variance of each PC
pca_var <- (pca_out$sdev)^2
# calculate proportion of variance explained
total_var <- sum(pca_var)
pve <- pca_var/total_var
# scree plot
pve %>%
as.data.frame() %>%
mutate(index = seq_len(length(pca_var))) %>%
ggplot(aes(x = index, y = pve)) +
geom_point() +
geom_line() +
labs(x = 'SNP Number', y = 'Percent of Variance Explained') +
theme_minimal()
What do you learn from this plot?
The first PC explains over 15% of the total variance. The percent of variance quickly drops (down to 5–7%) when we look at the remaining PCs.
Note: looking at the percent of variance explained by each PC is another tool that researchers use to decide how many PCs to include in their GWAS models. This is sometimes referred to as the “elbow” method since the idea is to look at a scree plot, find the “elbow” in the curve, and then use all the PCs prior to that point where the line flattens out.
Now, let’s investigate the impact of adjusting for PCs in GWAS models.
First, conduct a genome-wide association study using marginal
regression models that do not make any adjusting for population
structure. In other words, use models of the form
trait ~ snp with no other covariates.
# empty vector to store p-values
pvals <- c()
# loop through the 15 SNPs
for(i in 1:15){
dat <- pcadata %>% select(trait, paste0('SNP',i)) # pull out just the trait and SNP of interest
mod <- lm(trait ~ ., data = dat) # regress trait on everything else (.), which is just SNP i in this case
pvals[i] <- summary(mod)$coef[2,4] # pull out the p-value for the slope
}
# plot -log10 pvalues
data.frame(p = pvals, SNP = 1:15) %>%
ggplot(aes(y = -log10(p), x = SNP)) +
geom_point() +
theme_minimal() +
ggtitle('Unadjusted Analysis')
What do you notice? Which SNP(s) have the smallest p-value?
In this analysis, SNP3 has the smallest p-value, and the p-values for SNPs 1 and 2 are also quite small.
Next, conduct a genome-wide association study using models that
adjust for the first PC (trait ~ snp + PC1). How do the
results compare to the unadjusted analysis?
# empty vector to store p-values
pvals <- c()
# loop through the 15 SNPs
for(i in 1:15){
dat <- pcadata %>%
select(trait, paste0('SNP',i)) %>% # pull out just the trait and SNP of interest
mutate(PC1 = pca_scores[,1]) # add the scores for PC1
mod <- lm(trait ~ ., data = dat) # regress trait on everything else (.), which is SNP i and PC1
pvals[i] <- summary(mod)$coef[2,4] # pull out the p-value for the slope
}
# plot -log10 pvalues
data.frame(p = pvals, SNP = 1:15) %>%
ggplot(aes(y = -log10(p), x = SNP)) +
geom_point() +
theme_minimal() +
ggtitle('Adjusted Analysis')
In this adjusted analysis, SNP3 is now the only one with a particularly small p-value.
The trait that we are using for these GWASs is simulated, so we know the “answers” here. In particular, I created this trait such that it depends on the genotype at SNP3 as well as the population that the person belongs to. Now that you know the “truth”, re-evaluate your results above. Which model is better? Why?
The second model (that adjusts for PC1) is better. In this context, population is a confounding variable: it’s associated with the outcome and it’s also associated with the genotypes at SNPs 1–5. When we don’t adjust for this confounding variable in our unadjusted analysis, we see that SNPs 1 and 2 (which have a particularly strong association with the confounding variable) have very small p-values even though they shouldn’t (since they’re not actually causally related to the trait): in other words, we’d probably have two false positives here (depending on our choice of significance threshold).
Suppose we have three populations instead of just two. How many PCs do we need in order to capture “ancestry” then?
Consider another toy dataset with 20 SNPs and 500 people in each of three populations:
head(pcadata3pop)
## population SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## 1 1 0 0 0 0 0 0 0 1 1 0 0 1
## 2 1 0 0 0 0 0 0 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0 1 0 2 0 0 0
## 4 1 0 0 0 0 0 2 2 1 0 0 0 1
## 5 1 0 0 0 0 0 1 1 0 1 1 1 0
## 6 1 0 0 0 0 0 0 0 1 0 1 0 0
## SNP13 SNP14 SNP15 SNP16 SNP17 SNP18 SNP19 SNP20
## 1 0 0 0 0 0 0 1 0
## 2 0 1 0 0 1 0 0 0
## 3 0 0 1 0 0 2 0 0
## 4 0 1 0 0 0 0 0 0
## 5 0 0 0 1 0 0 1 0
## 6 0 0 0 0 0 1 1 2
pcadata3pop %>%
count(population)
## population n
## 1 1 500
## 2 2 500
## 3 3 500
Let’s run PCA on these data:
# pull out genotypes
geno3pop <- pcadata3pop %>%
select(-population)
# run PCA
pca_out_3pop <- prcomp(geno3pop, center = T, scale = T)
# pull out scores, loadings, and variance
scores3pop <- pca_out_3pop$x
loadings3pop <- pca_out_3pop$rotation
var3pop <- (pca_out_3pop$sdev)^2
Here’s what our scores look like:
scores3pop %>%
as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(pcadata3pop$population)) %>% # add the population labels
ggplot(aes(x = PC1, y = PC2, color = population)) + # then plot
geom_point() +
scale_color_brewer(palette = 'Dark2')
And here’s the parallel coordinates plot:
# parallel coordinates plot
scores3pop %>%
as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(pcadata3pop$population)) %>% # add the population labels
ggparcoord(columns = 1:20, groupColumn = 'population', alpha = 0.2) + # plot the first 15 columns
theme_minimal() +
scale_color_brewer(palette = 'Dark2')
And here’s the scree plot:
# calculate proportion of variance explained
pve3pop <- var3pop/sum(var3pop)
# scree plot
pve3pop %>%
as.data.frame() %>%
mutate(index = seq_len(length(pve3pop))) %>%
ggplot(aes(x = index, y = pve3pop)) +
geom_point() +
geom_line() +
labs(x = 'SNP Number', y = 'Percent of Variance Explained') +
theme_minimal()
Now it looks like we need 2 PCs to capture population structure!
Many R packages and other software programs have been developed
specifically for applying PCA to genetic data. One of my favorites is
SNPRelate (in part because it was developed by researchers
in the Biostatistics Department at the University of Washington). You
can find a tutorial walking through examples of how to applying the
functions in the package here.